Introduction

Influences and inspirations

I have loved music since the day I was created. I love to be surrounded by glorious sounds and music I can really connect with. I’m by no means a musician or a sound engineer, I’m just a simple lover and enthusiast of listening to (and sometimes singing along with) music.

My family greatly influenced this musical enthusiasm. I grew up in a family full of music listening enthusiasts and “ninja singers” (people who sing when they think no one is around or can hear them!). My Dad influenced me with his eclectic collection of classic rock and famous mainstream songwriters from the 60’s through to the 80’s. From my Dad, I learned very early on to appreciate the crystal-clear sound vinyl records offer as well as quality stereo equipment. My Mum influenced me with her regular listening to radio stations, in particular Gold 104 FM (Melbourne Australia) and random sing-alongs in the car. My brother influenced me with his electrical engineering talents. I distinctly remember as a young kid the day he created his first car-boot boom box. Thundering rumbles of finely tuned bass could be heard from 3 blocks away in tranquil suburbia, and of course it was 90’s techno pop, electronica and occasionally Jamiroquai!

So how did all this intriguing musical enthusiasm influence me?

I was always drawn to classic and hard rock. I’ve always been fascinated with the interaction of sound compositions and poetic lyrical prose. Being an athlete since I was 9, I recognised early on how sounds and lyrics could influence my performances. There’s music to psych you up or calm you down, and rhymes to help you stay in time!

I discovered my fan-girl love for Irish rock group U2 when I was 14. It was their album “All That You Can’t Leave Behind” which got my attention. I then did what any obsessed fan-girl would do. I researched everything humanly possible about their music, and in addition to knowing every lyric to every song U2 ever wrote, I consequently ended up learning a great deal about many musical-related and life-experience topics. Things like the guitars and complex sound effects The Edge uses, Bono’s public speaking, song writing talents and stage presence, Larry Mullen Junior’s masterful methods of how to stay out of the public spotlight while protecting his band, and Adam Clayton’s talents for creative collaboration and sophisticated, artistic conversation.

As the years moved on, I began to open up and diversify my music collection. I have my husband to thank in part for this, introducing me to heavy metal and modern electronic genres. These days, my music interest is eclectic. In any single day I enjoy listening to anything between Eminem and Enya, Rammstein to Bobby McFerrin, Tibetan Buddhist Monk chanting to Dragonforce Through the Fire and the Flames. Yes, I’m also a huge fan of the Audiosurf game series, and Guitar Hero!

 

The purpose and objective for this analysis

The data of music and unpacking personal music preferences

There’s not many days that go by where I’m not listening to music. I spend hours every day working away while listening to music, many hours creating an inadvertent fashion trend with my headphones and many hours with my thoughts randomly jumping in with playful musical commentary.

More recently, this thought pattern commentary has been circulating in my head while listening to my personal collection of music:

  1. “This is such an eclectic playlist, how do you make sense of such a diverse collection of music? Can we prove this is truly an eclectic playlist?”.
  2. “How on earth are these pieces of music even related or connected?” “Yes Joshua Tree, I’m referring to you, I still haven’t found what I’m looking for and its not Bullet the Blue Sky!”
  3. “What makes this song unique, special, similar and different to another song in your playlist? Surely Led Zeppelin’s Kashmir is some kind of random outlier?” What makes this song so impactful?
  4. “Does this album actually tell a story or have a meaning?”
  5. “Is it possible to obtain an understanding of a music album through only observing its lyrics, having never actually listened to it?”
  6. “Hey Bree, you are a data scientist by trade. Music IS data, could you do something useful and insightful with it to find some answers?”

On that last comment, my inner kid sparked up and went “ooh this could be so much fun, we can learn, write, and do an analysis with sound engineering, signal processing, text mining and natural language processing”. My logical mind very quickly realised to cover off all of those topics in a single article wouldn’t be doing any of those single topics any justice. I would need to chunk-down these topics into separate articles. So for this article, the purpose is to demonstrate with great love:

A comprehensive music lyrical analysis!

Working purely with the lyrical components for a selected collection of music album data, I will apply contemporary text mining and natural language processing techniques to this data to explore the above questions.

The target audience for this analysis is aimed at the general population of music enthusiasts, data scientists, data-folk, and anyone who’s ever wondered what on earth Led Zeppelin’s Kashmir was about.

 

Technical

Guiding questions for this analysis

As we flow through this analysis I’m seeking to weave in my responses to five key technical questions which are relevant for text mining and natural language processing:

  1. How will I acquire, structure and group the data?
  2. How can I visualise the data to tell a story and help make meaningful sense of the data?
  3. What are the important key words and n-gram constructs associated with each artist’s album?
  4. What are the sentiments and emotional representations of each song and album?
  5. What topics emerge from the albums? Can machine learning be useful here?

 

Project workflow

This project will be written in R, with the Github repository link here.

Sequencing:

1. Data Considerations

    1. Intellectual property & copyrights
    1. Artist and album selection
    1. Obtaining lyrics
    1. Converting lyric sheets into a useful dataset
    1. Data cleaning & pre processing
    1. Additional data: Using the Spotify API
    1. Observing the specific and unique nature of music lyrics in a text analysis context
    1. Variable names and the source dataset
     

2. Manual Data Exploration

    1. Creating tokens & data wrangling
    1. Identifying vocal and instrumental songs
    1. Word counts by song & album
    1. Wordclouds
    1. Lexical diversity (vocabulary)
    1. song lyrics self-similarity matrices (SongSim) & repetition
    1. Term Frequency - Inverse Document Frequency (TF-IDF)
     

3. Sentiment Analysis & Natural Language Processing (NLP)

    1. NRC Emotional Sentiment
    1. NGrams, bi-grams & tri-grams
    1. Bi-gram network analysis
    1. Pair-wise comparisons
    1. Album similarity
    1. Song dissimilarity (agreement between lyrics)
     

4. Unsupervised Machine Learning

    1. Topic modelling: Structured Topic Modelling (STM)
     

5. Findings & Learnings

    1. Findings
    1. Learnings, gotchas, traps for young players
    1. Where to next & part 2

 

6. References

 

1. Data Considerations

a. Intellectual property & copyrights

Because we are working with copyrighted material in reference to music lyrics, I’d like to make it known that the copyright belongs to the artists and songwriters who created the songs and the lyrics.

 

b. Artist and album selection

After much deliberation and consulting with friends and family, I decided to select 6 artists and an album from each of these artists. The selection (cherry picking) of these artists is based on the human perception that when viewed as a collection this group of artists would be considered an “eclectic” grouping. This analysis will test this perception of eclectic grouping and by the end of this analysis, converge with a conclusion.

 

Artist Album Release Year Genre Tracks Review Tags Other Notes & Rationale for selection
Daft Punk Discovery 2001 Electronic
  1. One More Time
  2. Aerodynamic
  3. Digital Love
  4. Harder, Better, Faster, Stronger
  5. Crescendolls
  6. Nightvision
  7. Superheroes
  8. High Life
  9. Something About Us
  10. Voyager
  11. Veridis Quo
  12. Short Circuit
  13. Face to Face
  14. Too Long
Electronic, House, Rock, Techno, Funk, Modern Disco

Aerodynamic, Crescendolls, Nightvision, Superheroes, High Life, Voyager, Viridis Quo and Short Circuit are instrumental songs. We will include these for sound analysis in a subsequent exploration of music analysis.

Music video movie “Interstella 555: The Story of the Secret Star System”. The film is the visual realization of Discovery.

http://www.imdb.com/title/tt0368667/

Bree’s favourite Daft Punk album.

U2 The Joshua Tree 1987 Rock
  1. Where the streets have no name
  2. I still haven’t found what I’m looking for
  3. With or without you
  4. Bullet the blue sky
  5. Running to stand still
  6. Red hill mining town
  7. In God’s country
  8. Trip through your wires
  9. One tree hill
  10. Exit
  11. Mothers of the disappeared
Rock, Gospel, One of the biggest selling albums of all time. 30th Anniversary of its release in 2017 saw U2 take it on tour.
Elton John Honky Chateau 1972 Rock
  1. Honky Cat
  2. Mellow
  3. I think I’m going to kill myself
  4. Susie (Dramas)
  5. Rocket Man (I think it’s going to be a long, long time)
  6. Salvation
  7. Slave
  8. Amy
  9. Mona Lisa and Mad Hatters
  10. Hercules
Rock, Pop, Rock & Roll

Rolling Stone believes this was the album which marked the transformation of Elton John from gentle singer/songwriter to a legitimate rock star.

https://www.rollingstone.com/music/pictures/readers-poll-the-10-best-elton-john-albums-20130918/5-honky-chateau-3-78-9

Killswitch Engage Alive or Just Breathing 2002 Metal Core
  1. Numbered Days
  2. Self Revolution
  3. Fixation on the Darkness
  4. My Last Serenade
  5. Life to Lifeless
  6. Just Barely Breathing
  7. To The Sons of Man
  8. Temple from Within
  9. The Element of One
  10. Vide Infra
  11. Without a Name
  12. Rise Inside
  13. When the Balance is Broken
Hardcore Metal,

Loudwire suggests this is the greatest album KsE produced.

http://loudwire.com/killswitch-engage-albums-ranked/

Bree has not yet listened to this album.

Iron Maiden Powerslave 1984 Heavy Metal
  1. Aces High
  2. Two Minutes to Midnight
  3. Losfer Words (Big ’Orra)
  4. Flash of the Blade
  5. The Duellists
  6. Back in the Village
  7. Powerslave
  8. Rime of the Ancient Mariner
Hard Rock, Heavy Metal

LouderSound suggests this is the greatest album Iron Maiden produced.

https://www.loudersound.com/features/every-iron-maiden-album-ranked-from-worst-to-best Bree has not yet listened to this album.

Led Zeppelin Physical Graffiti 1975 Rock
  1. Custard Pie
  2. The Rover
  3. In my Time of Dying
  4. Houses of the Holy
  5. Trampled Under Foot
  6. Kashmir
  7. In the Light
  8. Bron-Yr-Aur
Hard Rock, Heavy Metal

Bree wants to inspect Kashmir further.

Bron-Yr-Aur is an instrumental

 

c. Obtaining lyrics

I made a conscious decision to download the lyrics for the selected albums, manually. This was achieved by googling, finding the correct lyric sheets and then copying and pasting into Microsoft Word. I could have just as easily set up an R script to web-scrape, but I wanted to observe and have full control over the state and structure the lyrics were being stored as and apply some very specific formats, for a reason, which I will explain in the next section.

 

d. Converting lyric sheets into a useful dataset

For the source dataset we’re creating, we want to maintain the “structural integrity” of the music lyrics as we import them into data frames. This means our dataset looks the same way the lyric sheet looks. We have rows for album title, rows for song titles, rows for each verse and each chorus.

Why?

Music is all about patterns, sequences and time signatures. Lyrics are written differently to many other forms of written communication. Lyrics are also interwoven with sound, else it would be just poetry or a short story. It would be inappropriate at this point to treat the lyrics as one giant “word pool”. We need to maintain hierarchical groupings and identify which lyrics belong with which artist and album. Let’s firstly observe the lyrics in as close to the original structure as they were written.

How do we do this?

Let’s keep the lyric sheets in .docx format for now and check the integrity of the style formats. There’s magic in the little “¶” button on the paragraph menu pane in Microsoft Word.

For each lyric sheet we have, turn on paragraph marks and make the following adjustments:

  • Configure styles for Heading 1, Heading 2, Heading 3, Normal body and spacing after paragraph
  • Heading 1 – Artist name
  • Heading 2 – Album name
  • Heading 3 – Song title name
  • Song lyrics – Normal body

Each verse and chorus are separated by a paragraph. Individual lines within verses and choruses separated with carriage return (Enter key), with a single space at the end of each line.

Between each style component used, separate these with a single paragraph. Paragraphs will create new rows for our dataset, as will differentiating styles.

Where a song is instrumental, use "[Instrumental]" as a consistent tag across all lyric sheets for ease of flagging these songs for filtering later.

Here is an example of the lyric sheet for U2 - The Joshua Tree and reading it in to create a dataframe.

 

 

We can use the officer and qdapTools packages to read in the collection of lyric sheet documents:

# Working with word documents (docx):
# Read in each file and use docx_summary() to map the styles used in each document as records in a dataframe.
# Then apply some common hirarchial groupings: Artist, Album, Track Name, and add in verse line counters.
library(officer)
library(qdapTools)
library(zoo)

doc02 <- read_docx(F("Data/Raw/U2 - The Joshua Tree.docx"))
raw.data02 <- docx_summary(doc02) %>%
  mutate(CATMusicArtist = "U2",
         CATMusicAlbum = "The Joshua Tree",
         CATTrackName = ifelse(style_name == "heading 3", text, "None"),
         CATTrackName = zoo::na.locf(CATTrackName), #fill down track number
         style_name = ifelse(is.na(style_name), "body", style_name)
  ) %>%
  group_by(CATTrackName) %>%
  mutate(NUMTrackLyricLineNumber = sequence(n()) - 1) #using minus 1 so we dont include the main heading

 

Our imported lyric sheet for U2 - The Joshua Tree has now become a nicely structured data frame. Here’s an example of the first five records:

 

 

 

We will repeat the above steps for the other Artists and Albums in our selection and append the datasets together.

 

e. Data cleaning & pre processing

Next, we will perform some data checks and cleanup using the QDAP package.

# Using the dataset which contains all raw imported lyric sheets from the previous step:
# Tidy up the track name variable for when we need to use this as a merge key
# Identify the songs which are instrumental. We will filter these out for this analysis.

wrk.01_Data_Prep <- wrk.data %>%
  mutate(BINTrackIsInstrumental = ifelse(style_name == "body" & trimws(text) == "[Instrumental]", 1, 0)) %>%
  mutate(KEYTrackName = toupper(trimws(CATTrackName)))
       
# Use QDAP qview to identify suggestions for tidy up. Results will be written to a text file.
qview(wrk.01_Data_Prep)   
check_text(wrk.01_Data_Prep$text, file=F("Data/Raw/QDAPCheckText_wrk.01_Data_Prep.txt")) 

# Applying QDAP cleanup recommendations
wrk.01_Data_Prep$text <- replace_number(wrk.01_Data_Prep$text, num.paste = TRUE, remove = FALSE)
wrk.01_Data_Prep$text <- incomplete_replace(wrk.01_Data_Prep$text)
wrk.01_Data_Prep$text <- comma_spacer(wrk.01_Data_Prep$text)
wrk.01_Data_Prep$text <- clean(wrk.01_Data_Prep$text)
wrk.01_Data_Prep$text <- scrubber(wrk.01_Data_Prep$text, fix.comma = TRUE, fix.space = TRUE)

 

f. Additional data: Using the Spotify API

I thought it would be useful to explore alternative sources of data to complement the lyrical dataset we just created. In my exploration, I came across the spotifyr package with more details here. After setting up a Spotify developers account I was able to obtain the necessary “client ID” and “client secret” tokens, and set these as environment variables to use in the spotifyr functions.

 

#devtools::install_github('charlie86/spotifyr')
#install.packages('spotifyr')
library(spotifyr)
# Reference: https://github.com/charlie86/spotifyr

# app name "MusicLyricAnalysis1"
Sys.setenv(SPOTIFY_CLIENT_ID = "ADD TOKEN HERE")
Sys.setenv(SPOTIFY_CLIENT_SECRET = "ADD TOKEN HERE")

access_token <- get_spotify_access_token()

#Extract data from spotify
spotify_df_U2 <- get_artist_audio_features('U2',access_token)
spotify_df_DaftPunk <- get_artist_audio_features('Daft Punk',access_token)
spotify_df_EltonJohn <- get_artist_audio_features('Elton John',access_token)
spotify_df_LedZeppelin <- get_artist_audio_features('Led Zeppelin',access_token)
spotify_df_KillswitchEngage <- get_artist_audio_features('Killswitch Engage',access_token)
spotify_df_IronMaiden <- get_artist_audio_features('Iron Maiden',access_token)

spotify_U2_filtered <- filter(spotify_df_U2, album_name == "The Joshua Tree (Deluxe)") #we have extra live songs
spotify_DaftPunk_filtered <- filter(spotify_df_DaftPunk, album_name == "Discovery")
spotify_EltonJohn_filtered <- filter(spotify_df_EltonJohn, album_name == "Honky Chateau") #no results found
spotify_LedZeppelin_filtered <- filter(spotify_df_LedZeppelin, album_name == "Physical Graffiti")
spotify_KillswitchEngage_filtered <- filter(spotify_df_KillswitchEngage, album_name == "Alive or Just Breathing [Topshelf Edition]")
spotify_IronMaiden_filtered <- filter(spotify_df_IronMaiden, album_name == "Powerslave (1998 Remastered Edition)")

# We can keep these datasets in data frame format for now, dataset size is tiny.
raw.SpotifyArtistList <- list(spotify_U2_filtered, spotify_DaftPunk_filtered, spotify_EltonJohn_filtered,
                              spotify_LedZeppelin_filtered, spotify_KillswitchEngage_filtered, spotify_IronMaiden_filtered)
# Append all above raw datasets together
raw.SpotifyArtistAlbumTrackData <- rbindlist(raw.SpotifyArtistList) %>%
  mutate(KEYTrackName = toupper(trimws(track_name)))

# Save Feather file from 
write_feather(raw.SpotifyArtistAlbumTrackData, F("Data/Raw/raw.SpotifyArtistAlbumTrackData.feather"))

wrk.01_DataPrep_LyricsWithSpotify <- list(wrk.01_Data_Prep, raw.SpotifyArtistAlbumTrackData) %>%
  reduce(left_join, by = c("KEYTrackName" = "KEYTrackName"))

# Save Feather file from 
write_feather(wrk.01_DataPrep_LyricsWithSpotify, F("Data/Processed/wrk.01_DataPrep_LyricsWithSpotify.feather"))

 

Inspecting the Spotify data for the collection of albums:

# Check over the dataset
glimpse(raw.SpotifyArtistAlbumTrackData)
## Observations: 76
## Variables: 24
## $ album_uri          <chr> "2t4UTpa53ALkISHhiKrEtv", "2t4UTpa53ALkISHh...
## $ album_name         <chr> "The Joshua Tree (Deluxe)", "The Joshua Tre...
## $ album_img          <chr> "https://i.scdn.co/image/3dc58a6d1e838ff4d5...
## $ album_release_date <chr> "1987-03-03", "1987-03-03", "1987-03-03", "...
## $ album_release_year <date> 1987-03-03, 1987-03-03, 1987-03-03, 1987-0...
## $ album_popularity   <int> 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62,...
## $ track_name         <chr> "Where The Streets Have No Name", "I Still ...
## $ track_uri          <chr> "2IlT1DLSpmmHkHlAeuHMU3", "4GW8K6bDiiJGEgGP...
## $ danceability       <dbl> 0.4950, 0.5660, 0.5430, 0.3360, 0.5270, 0.3...
## $ energy             <dbl> 0.728, 0.783, 0.432, 0.653, 0.189, 0.679, 0...
## $ key                <chr> "D", "C#", "D", "G#", "D", "C", "E", "C", "...
## $ loudness           <dbl> -9.549, -9.412, -11.832, -10.210, -18.605, ...
## $ mode               <chr> "major", "major", "major", "major", "major"...
## $ speechiness        <dbl> 0.0385, 0.0363, 0.0288, 0.0499, 0.0293, 0.0...
## $ acousticness       <dbl> 0.011000, 0.015700, 0.000207, 0.006700, 0.8...
## $ instrumentalness   <dbl> 0.0035300, 0.0030000, 0.3690000, 0.4380000,...
## $ liveness           <dbl> 0.1510, 0.0806, 0.1460, 0.1360, 0.3340, 0.2...
## $ valence            <dbl> 0.2180, 0.5870, 0.1070, 0.4610, 0.2090, 0.3...
## $ tempo              <dbl> 125.810, 100.864, 110.196, 152.308, 94.642,...
## $ duration_ms        <dbl> 337506, 277477, 295516, 271547, 257366, 292...
## $ time_signature     <dbl> 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 3, 4...
## $ key_mode           <chr> "D major", "C# major", "D major", "G# major...
## $ track_popularity   <int> 39, 41, 68, 34, 34, 33, 33, 31, 32, 29, 29,...
## $ KEYTrackName       <chr> "WHERE THE STREETS HAVE NO NAME", "I STILL ...

After attempting to extract the desired albums from the selected artists, I discovered that not all albums or songs from the albums were available on spotify. For example, Elton John’s Honky Chateau was not available to extract, instead only single tracks were available via a best-of album. For U2’s The Joshua Tree, the original release album is not available, only the deluxe edition which comes with extra live performance tracks. Similar situation for Iron Maiden’s Powerslave.

With this information, I decided that the Spotify data is simply a “nice to have” and will not be entirely useful to this analysis. I would very much have liked for it to cover all artists, all albums and all songs in the collection.

We will leave the spotify variables in the merged dataset, but we won’t use them in this analysis.

 

CheckSpotify <- select(raw.SpotifyArtistAlbumTrackData,album_name, track_name)
checkSpotify_NotMatchingSource <- anti_join(CheckSpotify, wrk.01_Data_Prep, by = c("track_name" = "CATTrackName") ) #all spotify songs not matching our source data
glimpse(select(checkSpotify_NotMatchingSource,album_name, track_name ))
## Observations: 42
## Variables: 2
## $ album_name <chr> "The Joshua Tree (Deluxe)", "The Joshua Tree (Delux...
## $ track_name <chr> "Where The Streets Have No Name", "With Or Without ...

 

g. Observing the specific and unique nature of music lyrics in a text analysis context

A few things we can observe and acknowledge so far:

  • Some songs will be instrumental. The raw lyric sheets will only contain “[Instrumental]” for instrumental songs. We can filter these out of our analysis and re-use these in subsequent analysis. Perhaps for signal or beat detection analysis!
  • Each lyric “line” is meaningful in the flow of a song. Each line can be linked to subsequent lines via rhyming and context. We expect the lyric sheets to loosely resemble poetry and we expect a higher instance of repetition, because music is full of patterns (unless we’re dealing with some kind of random jam session or jazz genres!)
  • The number of songs will vary per artist and album. Some albums will have more songs than others. We need to be mindful of any analysis utilising word counts or averages.
  • In reality, the lyrics are interwoven with an audio track. This analysis is like staring at words in segregated silence. With the audio track interwoven this adds the dimensions of time series, rhythm, verbal intonation and speech/verbal patterns, and emotional sentiment, all of which lead us to interpret the lyrics differently to how we would analysing lyrics in isolation. Think for a moment, when you recieve text based emails or text messages, the emotions, sentiment and context of the communication can be very difficult to convey in text format alone, the same message transmitted as sound can result in a very different reception.

 

h. Variable names and the source dataset

Here is a summary of the source dataset we’ll use for this analysis. It is one row per album track. All lyrics for each track are contained within a single variable TXTAllTrackLyrics. The end of each lyric line is signified with “<br>" tag.

# Create a new dataframe with one row of lyrics for each track (instead of multiple rows per verse/chorus)
wrk.02_TextAnalysis_00 <-  wrk.01_DataPrep_LyricsWithSpotify %>% 
  group_by(CATTrackName) %>% 
  mutate(TXTAllTrackLyrics = paste0(text, sep = "<br>", collapse = " ")) %>%
  mutate(NUMMaxLyricLines = max(as.numeric(NUMTrackLyricLineNumber)))

# Create a dataset with one row per song. One variable & record to hold all lyrics for a song.
wrk.02_TextAnalysis_01 <- wrk.02_TextAnalysis_00 %>%
  filter(style_name == "heading 3")

# lets now remove songs which were purely instrumental only
wrk.02_TextAnalysis_02 <- wrk.02_TextAnalysis_01 %>%
  filter(str_detect(TXTAllTrackLyrics, "Instrumental") == FALSE )

#Summary
glimpse(wrk.02_TextAnalysis_02)
## Observations: 59
## Variables: 37
## $ doc_index               <int> 3, 22, 29, 68, 78, 82, 3, 10, 15, 22, ...
## $ content_type            <chr> "paragraph", "paragraph", "paragraph",...
## $ style_name              <chr> "heading 3", "heading 3", "heading 3",...
## $ text                    <chr> "One More Time", "Digital Love", "Hard...
## $ level                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ num_id                  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ CATMusicArtist          <chr> "Daft Punk", "Daft Punk", "Daft Punk",...
## $ CATMusicAlbum           <chr> "Discovery", "Discovery", "Discovery",...
## $ CATTrackName            <chr> "One More Time ", "Digital Love ", "Ha...
## $ NUMTrackLyricLineNumber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BINTrackIsInstrumental  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ KEYTrackName            <chr> "ONE MORE TIME", "DIGITAL LOVE", "HARD...
## $ album_uri               <chr> "2noRn2Aes5aoNVsU6iWThc", "2noRn2Aes5a...
## $ album_name              <chr> "Discovery", "Discovery", NA, "Discove...
## $ album_img               <chr> "https://i.scdn.co/image/1a9dab25976c7...
## $ album_release_date      <chr> "2001-03-07", "2001-03-07", NA, "2001-...
## $ album_release_year      <date> 2001-03-07, 2001-03-07, NA, 2001-03-0...
## $ album_popularity        <int> 76, 76, NA, 76, 76, 76, 62, 62, 62, 62...
## $ track_name              <chr> "One More Time", "Digital Love", NA, "...
## $ track_uri               <chr> "0DiWol3AO6WpXZgp0goxAV", "2VEZx7NWsZ1...
## $ danceability            <dbl> 0.611, 0.644, NA, 0.875, 0.874, 0.691,...
## $ energy                  <dbl> 0.697, 0.664, NA, 0.475, 0.437, 0.582,...
## $ key                     <chr> "D", "A", NA, "A", "C#", "F", "D", "C#...
## $ loudness                <dbl> -8.618, -8.398, NA, -12.673, -10.234, ...
## $ mode                    <chr> "major", "major", NA, "minor", "minor"...
## $ speechiness             <dbl> 0.1330, 0.0333, NA, 0.0986, 0.0706, 0....
## $ acousticness            <dbl> 0.019300, 0.048100, NA, 0.440000, 0.00...
## $ instrumentalness        <dbl> 0.0000000, 0.8620000, NA, 0.7200000, 0...
## $ liveness                <dbl> 0.3320, 0.3420, NA, 0.0460, 0.0293, 0....
## $ valence                 <dbl> 0.4760, 0.5300, NA, 0.3840, 0.9630, 0....
## $ tempo                   <dbl> 122.752, 124.726, NA, 99.958, 117.790,...
## $ duration_ms             <dbl> 320357, 301373, NA, 232667, 240173, 60...
## $ time_signature          <dbl> 4, 4, NA, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ key_mode                <chr> "D major", "A major", NA, "A minor", "...
## $ track_popularity        <int> 75, 63, NA, 67, 60, 52, 39, 41, 68, 34...
## $ TXTAllTrackLyrics       <chr> "One More Time<br> One more time One m...
## $ NUMMaxLyricLines        <dbl> 16, 6, 30, 3, 3, 12, 6, 4, 6, 6, 3, 8,...

From this dataset, the most important variables we will be using frequently will be:

  • doc_index, the primary key of the source dataset. This will become particularly useful for creating a corpus and performing ngram analysis.
  • text & TXTAllTrackLyrics, usage will depend on the character string structure we need for text analysis functions.
  • CATMusicArtist, our main grouping variable.
  • CATMusicAlbum, our secondary grouping variable.
  • CATTrackName, most granular grouping variable.
  • BINTrackIsInstrumental, for filtering out songs with no lyrics.

 

2. Manual Data Exploration

a. Creating tokens & data wrangling

Our dataset is almost ready for exploration. There’s a few remaining data wrangling steps to complete:

  1. Manually identify any “undesirable words” in the context of lyrics. Sometimes we see markers in the lyric sheets like “chorus”, “verse”, “repeat x3” etc.
  2. We have the option to remove stop words at this point, although I would like to leave them in for initial exploration to purely observe for now, and remove them when we begin more detailed text analysis.
  3. Create lyric line tokens, split the lyrics into individual verses.
  4. Create lyric word tokens, split the lyrics into individual words.
  5. Create numeric counter of how many words per song, per album & artist.

 

library(tidytext)

#Identify any specific/customisable words we wish to eliminate later on
undesirable_words <- c("chorus", "lyrics", "verse")

#Create lyric verse tokens using tidytext [dataset will be one row per verse]
lineToken <- wrk.02_TextAnalysis_02 %>%
  ungroup() %>%
  unnest_tokens(line, TXTAllTrackLyrics, token = stringr::str_split, pattern = '<br>') %>% #break the lyrics into verses
  mutate(lineCount = row_number()) #Create a line counter so we know which verse record is which

# Create lyric word tokens and apply tidy text format [dataset will be one row per lyric word]
# We have the option to remove stop words at this point. 
# Selecting not to do this yet, as we wish to visually observe the raw form of the dataset.
wordToken <-  lineToken %>% 
  unnest_tokens(word, line) %>%  #Break the lyrics into individual words
  # anti_join(stop_words) %>% #TM/tidytext removing stop words
  filter(!word %in% undesirable_words) #removing custom configured stop words

# Add in full word counts for each song (non-distinct)
wrk.02_TextAnalysis_03_WordCount <- wordToken %>%
  group_by(CATMusicArtist,CATMusicAlbum, CATTrackName) %>%
  summarise(num_words = n()) %>%
  arrange(desc(num_words)) 

 

b. Identifying vocal and instrumental songs

We need to identify which songs are instrumental so we can exclude them for this analysis. Not surprisingly, more than half of the songs on Daft Punk’s Discovery are instrumental. This is significant as 8 of the 14 songs on the album will be flagged as instrumental, leaving only 6 songs available to lyrically analyse and compare with the other 5 selected artists.

 

Instrumental Songs to Exclude from Analysis
CATMusicArtist CATMusicAlbum CATTrackName
Daft Punk Discovery Aerodynamic
Daft Punk Discovery Crescendolls
Daft Punk Discovery Nightvision
Daft Punk Discovery Superheroes
Daft Punk Discovery High Life
Daft Punk Discovery Voyager
Daft Punk Discovery Veridis Quo
Daft Punk Discovery Short Circuit
Iron Maiden Powerslave Losfer Words (Big ’Orra)
Killswitch Engage Alive or Just Breathing Without A Name
Led Zeppelin Physical Graffiti Bron-Yr-Aur

 

c. Word counts by song & album

With our dataset now ready for exploration, let’s inspect these questions:

  • In total, how many songs with lyrics are available to work with?
  • What are the raw word counts for each of these songs?

We have 59 songs to work with, and Iron Maiden’s Rime of the Ancient Mariner has a very large word count at 650 words, significantly higher than any other song in our dataset.

Could this song’s lyrics skew any subsequent analysis we will be performing?

 

 

While this table shows us the raw individual word counts for each song, it doesn’t clearly illustrate the Artist-Album groupings and total word counts for Artist-Album. Some questions here:

  • How many songs do we have for each artist & album?
  • What is the “wordiest” album in the collection? Could this be used as a basic indicator of the balance between story telling and sound engineering or audible experience?

We must be careful here about using total word counts. Considering Daft Punk’s Discovery had 8 of 14 songs excluded due to them being instrumental, there are less songs available to contribute to an overall word count.

 

From this visualisation:

  • Daft Punk’s Discovery has 6 songs with lyrics, total word duration of just under 1500.
  • Elton John’s Honky Chateau has 10 songs with lyrics, total word duration of just over 1500.
  • Iron Maiden’s Powerslave has 7 songs with lyrics, total word duration of just under 2000.
  • Killswitch Engage’s Alive or Just Breathing has 12 songs with lyrics, total word duration of just under 1400.
  • Led Zeppelin’s Physical Graffiti has 13 songs with lyrics, total word duration of near 3000.
  • U2’s The Joshua Tree has 11 songs with lyrics, total word duration of just under 2000.

There’s not enough information here to draw any meaninful conclusions about story telling or proving the collection is eclectic. However it is interesting to note the variance of word counts between songs. In general, it seems a song can be of any word length between 0 (instrumental) and 650 (or even more) words, and an album compilation can have any number of songs listed. There is no consistency between the artists on either of these observations.

 

d. Wordclouds

We can visualise the raw word counts using word clouds. The intention here is to get an initial, basic view as to the common words for each artist & album.

We do need to be careful with the usage and context of these visualisations here because we know some songs were designed for repetition and certain words will dominate for this reason.

For this section I will use the wordcloud package to create the word clouds for each artist and album.

 

Daft Punk - Discovery

U2 - The Joshua Tree

Elton John - Honky Chateau

Led Zeppelin

Killswitch Engage

Iron Maiden

 

On initial observeration of these word clouds:

  • All of these wordclouds are visually different and unique! This view shows us on broad, face value the collection of artists, albums and songs in the dataset are an eclcectic collection.
  • There is an observable difference in word choice & vocabulary used between artists. See Led Zeppelin’s relaxed usage of “mama” and “baby” versus the articulate words from Iron Maiden “mariner”, “village”.
  • Seemingly larger sized/rendering area word cloud with more defined words of Elton John versus a smaller word cloud with less defined words in Daft Punk.
  • Perceived similar (but not precisely the same), synonymous word usage between U2 and Killswitch Engage: “heart”, “eyes”. “life” versus “love”.

 

e. Lexical diversity (vocabulary)

Time to explore the depth of each song’s lyrical vocabulary, we will refer to this as “lexical diversity”.

A curious subjective question at this point is: Could the larger the vocabulary for a song (and therefore artist) be an indicator of great story telling?

In calculating the lexical diversity we will:

  • Remove stop words
  • Work with a dataset which is one row per word (un-nested, token by word)
  • Group by Artist & Album
  • Count by distinct words used in each song
  • Visualise using a pirateplot box plot, from the yarrr package

 

In this visualisation, each dot represents a song on the artist’s album, plotted by its total distinct (unique) count of words used. The black horizontal lines measure the average count of distinct words used for the artist & album.

It becomes immediately obvious, Iron Maiden’s Powerslave contains a range of more and less lexically dense songs. Iron Maiden’s song “Rime of the Ancient Mariner” is by far an outlier in this visualisation, at 167 distinct words.

On the lower end of the scale, Daft Punk’s Discovery and Killswitch Engage’s Alive or Just Breathing appear to have a smaller vocabulary. Some human experiential insights from these musical compositions suggest that these albums are very engineered for audio experience. Daft Punk’s Discovery could be described as a dance record, while Killswitch Engage’s Alive or Just Breathing may have been designed to get the most out of vocally “growling” lyrics and so careful selection of words to achieve this purpose could be what we are observing in a lower vocabulary range.

In the mid-range of the lexical diversity we see Elton John, Led Zeppelin and U2, perhaps this is where the balance between audio experience and story telling can be observed?

With this measure and perspective, we begin to see some similarities emerging between the artists and albums. The notion this is an eclectic grouping of music begins to have a counter argument.

 

f. Song lyrics self-similarity matrices (SongSim) & repetition

We have observed the lexical diversity for our collection of songs. Let’s now take a look at lyrical repetition within individual songs and observe for consistencies within and across albums.

Measuring and observing lyrical (or word) repetition is very relevant for this analysis as our context is music. Repetition can be observed in both instrumental waveforms and in lyrical structure. Ever had a song stuck in your head, repeating over and over?

The challenge here is to identify and use a visualisation which neatly and clearly describes repetition for our collection of songs.

Enter package songsim.

SongSim uses self-similarity matrices to visualise patterns of repetition in text. Each word (lyric) of a song forms a row and a column of the matrix. The cell at position (x, y) is filled in if the x-th and y-th words of the song are the same. For a more technical explanation check out the package author’s site here.

A self-similarity matrix is used to answer the question “which parts of this text thing are alike?”.

To get started, let’s setup our process flow for SongSim and take a look at Led Zeppelin’s Kashmir.

   

Most of Kashmir has very little lyrical repetition. This can be seen in the songsim matrix plot, where the “dots” form a weak and very sparse pattern. The lack of lyrical repetition is also observed by reading the lyric sheet. I find this a curious song to analyse. The element which strikes me the most is actually the sound of the song, it has a very repetitious component to it. When I listen to Kashmir it is this element of sound that I notice “gets stuck in my head”, and not the low-repetition lyrics! In the TV series “It might get loud”, Jimmy Page describes the guitar sound of Kashmir:

“…it has this riff which is circling around, then a cascade which goes over the top and hits this atonal point. It’s one of those real hypnotic riffs.”

Returning to the songsim matrix plot, we progress down the black diagonal line to the bottom right of the square we begin to see some pattern “blobs” coloured in blue, purple and pink, this is attributed to the repeated lyrics:

Ooh, yeah-yeah, ooh, yeah-yeah, when I’m down…

Ooh, yeah-yeah, ooh, yeah-yeah, well I’m down, so down

Ooh, my baby, ooh, my baby, let me take you there

Let me take you there. Let me take you there

The “Colorful” mode of the SongSim matrix plot assigns a unique color to each repeated word (words appearing only once are black). When there are several repeated themes, this can make it easier to distinguish them.

  The SongSim matrices also come with some handy parameters:

  • $ songMat - this is the matrix structure for the SongSim plot
  • $ repetitiveness - this quantifies how repetitive a song is. It is a simple mean of the upper triangle of the matrix. The larger the value, the more repetitive the song.

We can create songsim matrices for our collection of songs and then compare repetitiveness scores as a method of assessing lyrical repetition (or lyrical density).

 

In this visualisation, each dot represents a song on the artist’s album, plotted by its lyrical repetitiveness score. The black horizontal lines measure the average song lyrical repetitiveness for the artist & album. A measure of 0.0 (0%) means each word in the song is unique, nothing is the same. A measure of 1.0 (100%) means the song is completely repetitive, i.e. a song with only one word used throughout the whole song.

Is it so surprising to see Daft Punk’s Discovery with the highest average lyrical repetitiveness? Interesting it also has the widest range of repetitiveness.

Curious to observe U2’s The Joshua Tree has the lowest average and the smallest range of lyrical repetitiveness. The grouping of songs appear to be very consistent in terms of lyrical repetition.

So what do all of these songs look like as SongSim matrices? We have an opportunity here to create some “data art”, and also observe each of these songs visually on one canvas.

Click here to see the SongSim poster we created for the collection of songs in this analysis.

 

Some notable mentions from the “SongSim Matrix Poster”:

  • Entire patterned square: Daft Punk’s “Harder, Better, Faster, Stronger” is a great example of visualised pop music. The chorus is basically the entire song. Daft Punk’s “One more time” and “Too long” also fit this description and pattern.
  • Small checkerboard-like patterns: Most of Elton John’s songs only have a small number of words within a repetition, see “Salvation” and “Hercules”.
  • Verses and Bridges “gutter” patterns: Most of Iron Maiden’s and U2’s songs appear to follow an into - verse - chorus - verse - chorus - outro pattern.
  • Broken Diagonal patterns: Most of Killswitch Engages’s songs suggest a variation on the chorus or another major repeating section. Most of their songs are structured with verse - chorus - verse - chorus, but at the very end, some words moved around or swapped out.
  • Hybrid of “gutter” and “broken diagonal” patterns: Very much Led Zeppelin.
  • No two songs “look” the same. There may be similarities in terms of sections of patterns, but when lyric structure, lyric repetitiveness, and vocabulary are observed together, the songs are truly different from one another.

This has certainly uncovered some very interesting insights about how the songs in the collection are structured, and maybe even shed some light on the Artist’s preferences toward lyrical song writing.

The songsim matrices have done a great job of displaying how diverse each song (and therefore artist & album) is from a lyrical pattern and structure perspective. This gives strength to my hypothesis that the collection of music is eclectic!

 

g. Term Frequency Inverse Document Frequency (TF-IDF)

Let’s now address quantifying how important various lyrics (words) are in a song with respect to an album.

The Term Frequency - Inverse Document Frequency (TF-IDF for short), is a measure of:

Term Frequency * Inverse Document Frequency

  • The Term Frequency, the number of times a word is counted in a document

    Multiplied by

  • 1/DF, or 1 divided by the number of documents that contain each word

With the TF and IDF combined, a word’s (or a lyric’s) importance is adjusted for how rarely it is used. The assumption with TF-IDF is that words that appear more frequently in a document (or a song) should be given a higher weighting, unless the word also appears in many documents (or songs).

For this analysis we can use TF-IDF to identify which words are important to each of the albums in the collection, and compared across albums. We expect the albums to differ in terms of subject/topic, content and sentiment, we therefore expect the frequency of words to differ between albums, the TF-IDF metric will highlight these differences.

So let’s take a look at word importance through the lens of TF-IDF.

 

 

 

We see lots of familiar lyric words here which we can trace clearly back to individual songs. There are also specific narrative elements for individual songs, like Daft Punk’s “Harder, Better, Faster, Stronger”, we can see almost the entire chorus structure (we do know it’s highly repetitive!). True to the nature of the TF-IDF metric, the words to this song are really only found in this song and very unlikely elsewhere.

Same goes for “rocket” in Elton John’s Rocket Man, “(two) minutes (to) midnight” in Iron Maiden, “numbered days” in Killswitch Engage, “Kashmir” in Led Zeppelin, and “red hill mining town” in U2. These are all curiously exclusive to the songs they are found in.

There is one small similarity with “time” appearing for Daft Punk and also for Led Zeppelin, in both instances it is rated low in significance, at < 0.1, compared to other words for both artists.

With respect to the hypothesis on an eclectic grouping of music, we can clearly see that the top words of significance for each artists, is different and unique when compared across all artists.

This measure is performing as expected and doesn’t tell us anything new. However, this can still be useful information to be aware of prior to designing and training any models and exploring topic modelling.

Some questions we can ask before heading into sentiment analysis and machine learning:

  1. Do we need to perform more data preparation?
  2. Stemming: do we need to remove suffixes from words and reduce down to the common word origin? Is it appropriate?
  3. Lemmatization: do we need to remove inflectional endings of words, and return the base or dictionary form of a word (which is known as the lemma)?
  4. Advanced concept in sentiment analysis: Is it appropriate to simply replace some certain words with more frequently used synonyms (semantically similar peers) and/or hypernyms (common parents)? This would be used to address lexicon word matching challenges, between the words in the text versus the lexicon used.
  5. Do we need to construct our own lexicon for the sentiment analysis?

 

3. Sentiment Analysis & Natural Language Processing (NLP)

Sentiment analysis is a type of text mining which aims to determine the opinion and subjectivity of its content. When applied to song lyrics, the results can be representative of the artist’s attitude as well as their influences.

Natural Language Processing (NLP) is another methodology used in mining text. It tries to decipher the ambiguities in written language by tokenization, clustering, extracting entity and word relationships, and using algorithms to identify themes and quantify subjective information. This analysis in earlier sections touched on the basics of NLP, via exploring the lexical complexity of song lyrics (word frequencies, lexical diversity “vocabulary” and lexical density “repetition”).

 

a. NRC Emotional Sentiment

There are different methods which can be used for sentiment analysis. For this analysis we will explore our collection of songs using a predefined lexical dictionary (lexicon) named NRC.

The NRC lexicon assigns words into one or more of ten categories: positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

We will use the tidytext package and call in the NRC lexicon by using the get_sentiments() function when creating our data frame.

Let’s observe how the NRC lexicon triages our song lyric words into the emotional sentiments.

 

 

This is a very dense visualisation, difficult to distinguish any clear patterns.

Let’s summarise this down to word counts by emotional sentiment.

 

 

So how do these album’s make us feel when we listen to them?

I have always suspected Daft Punk’s Discovery to be jovial and energetic. I am glad to see anticipation, joy and trust rate highly for this album.

Equally for Killswitch Engage’s Alive or Just Breathing, this album tells stories with much sadness, anger and fear. Not very surprising these emotions rate highly for this album.

 

b. NGrams, bi-grams & tri-grams

Earlier in this analysis we explored single word (or unigram) frequency counts. This section is dedicated to exploring what precedes and follows the most common words we have identified in our collection of songs.

We have the option at this point to remove stop words and undesirable words before calculating the bi-grams and tri-grams. For this piece, only the undesirable words have been removed. It would be interesting to observe the “direction” of the song lyrics, are songs written as one person toward another, a group of people, us versus them, collective “we”, directive “you”. Let’s take a look!

 

 

 

The lasting “sentiments” from observing these bi-grams and tri-grams:

  • Daft Punk - The lyrical repetitiveness is very obvious, strongest of all the artists. Ngram structure points toward “being in the moment” and self-enjoyment.
  • Elton John - Less ngram frequency, with a common thread of thankfulness (“thank”) and chill (“mellow”)
  • Iron Maiden - Strong themes of returning to a communal “village” and anticipating “midnight”
  • Killswitch Engage - Written from the experiences of the first person (“i am”, “me in”) and observing the passing of “time”
  • Led Zeppelin - The writer’s observations of relationships with people (“you didn’t”, “hey mama”, “love”)
  • U2 - The writer’s observations of life journeys, observations with people (“haven’t found”, “bullet the blue”, “you give”) and foreign places (“streets”)

 

c. Bi-gram network analysis

From our Bi-Gram constructs we can create a network graph using the ggraph and igraph packages. We can arrange words into connected nodes, with selected “centering words” at the centers.

From our earlier inspection of unigram word frequencies, the following high-frequency “centering words” have been selected:

"hey","feel","gonna","people","yeah","love","light","time", "life"

Our dataset will be grouped up, to represent all songs, for all artists and albums.

 

This is just a simple network graph to demonstrate the various methods we can use to visualise our song lyric data.

For any songwriter’s out there, perhaps this could be useful in identifying lyric constructs?

 

d. Pair-wise comparisons

Which songs are similar to each other in lyrical content? We can explore this by finding the pairwise correlation of lyric (word) frequencies within each song, using the pairwise_cor() function from the widyr package.

The assumption here is, the higher the correlation factor the higher the similarity between songs.

We will remove stop words for this analysis, to allow us to observe more meaningful results.

 

 

Well this is curious! We created a pairwise correlation plot using the igraph package and using the graph_from_data_frame function with our input dataset. We filtered our dataset to show only correlations stronger than 0.4

From this it appears only Elton John’s Mona Lisa and Mad Hatters and Led Zeppelin’s Down by the Seaside have a connection with a correlation greater than 0.4. Does this mean they are similar songs?

Interesting that no other songs had a high enough correlation in the pairwise calculation to be considered as “significant”.

 

e. Album similarity

Using the QDAP package we have access to a function called trans_venn(). Since we just observed pairwise correlations between songs, let’s take a look at similarity between albums, visualised as a venn diagram.

It appears U2’s The Joshua Tree is a versatile linking centroid in this venn diagram. Can the other albums realistically be “linked” together via The Joshua Tree?

 

f. Song dissimilarity (agreement between lyrics)

Just like the album similarity we calculated in the previous section, we can do something similar for songs. Still using the QDAP package, we can use the Dissimilarity() function. We can perform dissimilarity statistics, using the distance function to calculate dissimilarity statistics by grouping variables.

The Dissimilarity() function will return a matrix of dissimilarity values, which is the agreement between text, or song. We will plot this matrix as a dendogram and identify some potential clusters.

 

This dendogram offers us a very alternative view of how similar and different the individual songs in the collection are. We configured the dendogram to identify 14 clusters and illustrate the borders of these clusters with the coloured rectangles.

Curious how there is one large cluster in the middle, illustrated with the green rectangle.

 

4. Unsupervised Machine Learning

 

a. Topic modelling: Structured Topic Modelling (STM)

Topic modelling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for or what our target is.

Some questions for guiding this section are:

  • Can we identify any meaningful groups or themes in our collection of songs?
  • Which words (or lyrics) contribute to which topics?
  • Which topics contribute to which albums?

For this piece, we will use Structured Topic Modelling (STM) and make use of the packages quanteda and stm.

First step is to prepare our input dataset:

 

#Load the libraries
library(quanteda)
library(stm)

# Using our LineToken dataset from earlier, un-nest by single words
# Then we will remove stop words and filter out any more undesirable words
tidy_MusicLyrics <- lineToken %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) # %>%
  #filter(word != "")

# check the output from tidy_MusicLyrics and identify any further words to filter out above, rerun the above step if needed.
tidy_MusicLyrics %>%
  count(word, sort = TRUE)
## # A tibble: 90 x 2
##    word         n
##    <chr>    <int>
##  1 time        49
##  2 faster      32
##  3 harder      32
##  4 stronger    32
##  5 ancient     25
##  6 mariner     25
##  7 rime        25
##  8 dying       23
##  9 foot        21
## 10 trampled    21
## # ... with 80 more rows
# Create a Quanteda DFM object, ready to use in the STM
MusicLyrics_dfm <- tidy_MusicLyrics %>%
  count(CATMusicArtist, word, sort = TRUE) %>%
  cast_dfm(CATMusicArtist, word, n)

One of the drawbacks of STM is the need to select the number of topics “K” to train the model with. Fortunately, the stm package comes with lots of functions and support for choosing an appropriate number of topics for the model.

With our DFM object prepared and ready, we can proceed to running the STM model with parameters init.type="Spectral" and k = 0. This will train the model, and the algorithm will use K = 0 to calculate the maximum number of topics itself. From here we can further analyse and refine how many topics would be appropriate to set K to.

The Spectral initialization uses a decomposition of the VxV word co-occurrence matrix to identify “anchor” words, which are words that belong to only one topic and therefore identify that topic. The topic loadings of other words are then calculated based on these anchor words. This processes is deterministic, so that the same values are always reached with the same VxV matrix. The problem is that this process assumes that the VxV matrix is generated from the population of an infinite number of documents (or in our case, songs). Therefore, the process does not behave well with infrequent/rare words. The solution to this is to remove infrequent words, although we still need to be careful in situations where we don’t have a lot of documents, in our case we have 59 songs across 6 albums.

## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Remove Custom Stopwords...
## Removing numbers... 
## Stemming... 
## Creating Output...
# Train the model with 0 clusters K defined. Lets observe how many Topics emerge:
STM_topic_model <- stm(MusicLyrics_dfm, K = 0, seed=12345, verbose = FALSE, init.type = "Spectral")

Training the model on K = 0 clusters identifies 32 topics. We can use labelTopics() to describe the common words associated with each of these 32 topics.

labelTopics(STM_topic_model)
## Topic 1 Top Words:
##       Highest Prob: balance, temple, element, breathing, darkness, fixation, life 
##       FREX: balance, trip, bullet, sky, streets, mining, tree 
##       Lift: balance, temple, element, breathing, darkness, fixation, life 
##       Score: balance, temple, element, breathing, darkness, fixation, life 
## Topic 2 Top Words:
##       Highest Prob: barely, temple, element, breathing, darkness, fixation, life 
##       FREX: barely, trip, bullet, sky, streets, mining, tree 
##       Lift: barely, temple, element, breathing, darkness, fixation, life 
##       Score: barely, temple, element, breathing, darkness, fixation, life 
## Topic 3 Top Words:
##       Highest Prob: black, dying, trampled, kashmir, time, rover, flight 
##       FREX: black, dying, trampled, kashmir, time, rover, flight 
##       Lift: black, kashmir, trampled, dying, rover, flight, woman 
##       Score: black, dying, trampled, kashmir, rover, flight, woman 
## Topic 4 Top Words:
##       Highest Prob: blue, hill, mining, town, red, bullet, trip 
##       FREX: blue, hill, mining, town, red, bullet, trip 
##       Lift: blue, stand, found, tree, bullet, trip, sky 
##       Score: blue, hill, mining, town, red, bullet, trip 
## Topic 5 Top Words:
##       Highest Prob: broken, temple, element, breathing, darkness, fixation, life 
##       FREX: broken, trip, bullet, sky, streets, mining, tree 
##       Lift: broken, temple, element, breathing, darkness, fixation, life 
##       Score: broken, temple, element, breathing, darkness, fixation, life 
## Topic 6 Top Words:
##       Highest Prob: country, hill, town, mining, red, bullet, sky 
##       FREX: country, bullet, sky, streets, trip, tree, found 
##       Lift: country, hill, town, mining, red, bullet, sky 
##       Score: country, hill, town, mining, red, bullet, sky 
## Topic 7 Top Words:
##       Highest Prob: custard, dying, trampled, kashmir, rover, flight, woman 
##       FREX: custard, dying, trampled, kashmir, rover, time, flight 
##       Lift: custard, kashmir, trampled, dying, rover, flight, woman 
##       Score: custard, dying, trampled, kashmir, rover, flight, woman 
## Topic 8 Top Words:
##       Highest Prob: temple, days, element, breathing, darkness, fixation, life 
##       FREX: days, temple, element, breathing, darkness, fixation, life 
##       Lift: days, breathing, darkness, fixation, life, lifeless, revolution 
##       Score: days, temple, element, breathing, darkness, fixation, life 
## Topic 9 Top Words:
##       Highest Prob: disappeared, hill, mining, red, town, bullet, trip 
##       FREX: disappeared, hill, mining, red, town, bullet, trip 
##       Lift: disappeared, hill, mining, red, town, bullet, trip 
##       Score: disappeared, hill, mining, red, town, bullet, trip 
## Topic 10 Top Words:
##       Highest Prob: dramas, it’s, rocket, mona, salvation, amy, cat 
##       FREX: dramas, it’s, rocket, mona, salvation, amy, cat 
##       Lift: dramas, it’s, rocket, mona, salvation, amy, cat 
##       Score: dramas, it’s, rocket, mona, salvation, amy, cat 
## Topic 11 Top Words:
##       Highest Prob: exit, hill, mining, red, town, sky, streets 
##       FREX: exit, hill, mining, red, town, sky, streets 
##       Lift: exit, stand, found, tree, sky, streets, trip 
##       Score: exit, hill, mining, red, town, sky, streets 
## Topic 12 Top Words:
##       Highest Prob: foot, time, dying, trampled, kashmir, rover, flight 
##       FREX: foot, time, dying, trampled, kashmir, rover, flight 
##       Lift: foot, time, pie, boogie, stu, song, wanton 
##       Score: foot, time, dying, trampled, kashmir, rover, flight 
## Topic 13 Top Words:
##       Highest Prob: god's, hill, red, town, mining, bullet, trip 
##       FREX: god's, hill, red, town, mining, bullet, trip 
##       Lift: god's, stand, found, tree, bullet, trip, sky 
##       Score: god's, hill, red, town, mining, bullet, trip 
## Topic 14 Top Words:
##       Highest Prob: infra, temple, element, breathing, darkness, fixation, life 
##       FREX: infra, trip, bullet, sky, streets, mining, tree 
##       Lift: infra, temple, element, breathing, darkness, fixation, life 
##       Score: infra, temple, element, breathing, darkness, fixation, life 
## Topic 15 Top Words:
##       Highest Prob: inside, temple, element, breathing, darkness, fixation, life 
##       FREX: inside, trip, bullet, sky, streets, mining, tree 
##       Lift: inside, temple, element, breathing, darkness, fixation, life 
##       Score: inside, temple, element, breathing, darkness, fixation, life 
## Topic 16 Top Words:
##       Highest Prob: faster, harder, stronger, time, digital, kill, love 
##       FREX: faster, harder, stronger, digital, kill, love, time 
##       Lift: digital, kill, love, faster, harder, stronger, time 
##       Score: faster, harder, stronger, kill, time, digital, love 
## Topic 17 Top Words:
##       Highest Prob: light, dying, trampled, kashmir, time, rover, flight 
##       FREX: light, dying, trampled, kashmir, time, rover, flight 
##       Lift: light, kashmir, trampled, dying, rover, flight, woman 
##       Score: light, dying, trampled, kashmir, rover, flight, woman 
## Topic 18 Top Words:
##       Highest Prob: mad, rocket, it’s, amy, cat, hatters, hercules 
##       FREX: mad, rocket, it’s, amy, cat, hatters, hercules 
##       Lift: mad, amy, cat, hatters, hercules, honky, lisas 
##       Score: mad, rocket, it’s, amy, cat, hatters, hercules 
## Topic 19 Top Words:
##       Highest Prob: mellow, time, rocket, it’s, mona, salvation, amy 
##       FREX: mellow, time, mona, salvation, amy, cat, hatters 
##       Lift: mellow, time, mona, salvation, amy, cat, hatters 
##       Score: mellow, time, rocket, it’s, mona, salvation, amy 
## Topic 20 Top Words:
##       Highest Prob: mothers, hill, mining, town, red, bullet, sky 
##       FREX: mothers, hill, mining, town, red, bullet, sky 
##       Lift: mothers, hill, mining, town, red, bullet, sky 
##       Score: mothers, hill, mining, town, red, bullet, sky 
## Topic 21 Top Words:
##       Highest Prob: night, dying, trampled, kashmir, time, rover, flight 
##       FREX: night, dying, trampled, kashmir, time, rover, flight 
##       Lift: night, kashmir, trampled, dying, rover, flight, woman 
##       Score: night, dying, trampled, kashmir, rover, flight, woman 
## Topic 22 Top Words:
##       Highest Prob: temple, numbered, element, breathing, darkness, fixation, life 
##       FREX: numbered, temple, element, breathing, darkness, fixation, life 
##       Lift: numbered, breathing, darkness, fixation, life, lifeless, revolution 
##       Score: numbered, temple, element, breathing, darkness, fixation, life 
## Topic 23 Top Words:
##       Highest Prob: rise, temple, element, breathing, darkness, fixation, life 
##       FREX: rise, trip, sky, streets, bullet, mining, tree 
##       Lift: rise, temple, element, breathing, darkness, fixation, life 
##       Score: rise, temple, element, breathing, darkness, fixation, life 
## Topic 24 Top Words:
##       Highest Prob: running, hill, town, red, mining, sky, streets 
##       FREX: running, hill, town, red, mining, sky, streets 
##       Lift: running, hill, town, red, mining, sky, streets 
##       Score: running, hill, town, red, mining, sky, streets 
## Topic 25 Top Words:
##       Highest Prob: seaside, time, dying, trampled, kashmir, rover, flight 
##       FREX: seaside, dying, time, trampled, kashmir, rover, flight 
##       Lift: seaside, pie, boogie, stu, wanton, song, woman 
##       Score: seaside, dying, trampled, time, kashmir, rover, flight 
## Topic 26 Top Words:
##       Highest Prob: serenade, temple, element, breathing, darkness, fixation, life 
##       FREX: serenade, trip, bullet, sky, streets, mining, tree 
##       Lift: serenade, temple, element, breathing, darkness, fixation, life 
##       Score: serenade, temple, element, breathing, darkness, fixation, life 
## Topic 27 Top Words:
##       Highest Prob: sick, dying, trampled, time, kashmir, rover, flight 
##       FREX: sick, dying, trampled, time, kashmir, rover, flight 
##       Lift: sick, pie, boogie, stu, song, wanton, woman 
##       Score: sick, dying, trampled, kashmir, rover, time, flight 
## Topic 28 Top Words:
##       Highest Prob: slave, it’s, rocket, mona, salvation, amy, cat 
##       FREX: slave, it’s, rocket, mona, salvation, amy, cat 
##       Lift: slave, it’s, rocket, mona, salvation, amy, cat 
##       Score: slave, it’s, rocket, mona, salvation, amy, cat 
## Topic 29 Top Words:
##       Highest Prob: susie, it’s, rocket, mona, salvation, amy, cat 
##       FREX: susie, it’s, rocket, mona, salvation, amy, cat 
##       Lift: susie, it’s, rocket, mona, salvation, amy, cat 
##       Score: susie, it’s, rocket, mona, salvation, amy, cat 
## Topic 30 Top Words:
##       Highest Prob: ten, dying, trampled, kashmir, time, rover, flight 
##       FREX: ten, dying, trampled, kashmir, time, rover, flight 
##       Lift: ten, kashmir, trampled, dying, rover, flight, woman 
##       Score: ten, dying, trampled, kashmir, rover, flight, woman 
## Topic 31 Top Words:
##       Highest Prob: ancient, mariner, rime, village, midnight, minutes, aces 
##       FREX: ancient, mariner, rime, village, midnight, minutes, aces 
##       Lift: aces, village, ancient, blade, duellists, flash, mariner 
##       Score: ancient, mariner, rime, village, midnight, minutes, aces 
## Topic 32 Top Words:
##       Highest Prob: wires, hill, mining, town, red, trip, sky 
##       FREX: wires, hill, mining, town, red, trip, sky 
##       Lift: wires, stand, found, tree, trip, sky, streets 
##       Score: wires, hill, mining, town, red, trip, sky

We can now use plot.STM() and observe how common each topic is:

stm::plot.STM(STM_topic_model,type = "summary", xlim = c(0, 0.1))

Exploring semantic coherence and exclusivity for each topic using the function stm::topicQuality():

  • Semantic coherence is the empirical co-occurrence of words with high probability under a given topic. If you have word “apple” what is the probability the word “banana” will also appear? For the topic, the coherence is the sum of the log of these probabilities.
  • Exclusivity is the top words for a topic which are unlikely to appear in top words for other topics.
##  [1] -274.6426 -280.0723 -284.9488 -309.0505 -280.0723 -310.3935 -277.7341
##  [8] -280.0723 -310.3935 -303.7534 -310.3935 -277.7341 -310.3935 -280.0723
## [15] -280.0723 -319.9913 -277.7341 -304.2346 -308.3448 -310.3935 -277.7341
## [22] -280.0723 -280.0723 -310.3935 -277.7341 -280.0723 -277.7341 -306.6729
## [29] -306.6729 -277.7341 -303.4867 -310.3935
##  [1] 9.500000 9.500000 9.474528 9.353203 9.500000 9.500000 9.475682
##  [8] 9.496062 9.500000 9.472222 9.353203 9.306970 9.353203 9.500000
## [15] 9.500000 9.380764 9.474528 9.474647 9.420263 9.500000 9.474528
## [22] 9.496062 9.500000 9.500000 9.281457 9.500000 9.281718 9.472209
## [29] 9.472209 9.474528 9.414610 9.353203

From our initial 32 identified topics we can begin to see some groups of topics emerging. Perhaps we can refine these 32 topics down to a smaller number of topics? Perhaps 6 is a more reasonable number of topics.

Let’s train our model on 6 topics, setting parameter to K = 6.

STM_topic_model <- stm(MusicLyrics_dfm, K = 6, seed=12345, verbose = FALSE, init.type = "Spectral")

 

We can now observe and visualise the output from our trained model across the 6 topics.

Using the beta matrix we can see which words contribute the most to each topic.

 

td_BetaMatrix <- tidy(STM_topic_model)

td_BetaMatrix %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = paste0("Topic ", topic),
         term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip() +
  #scale_x_reordered() +
  labs(x = NULL, y = expression(beta),
       title = "Beta Matrix: Highest word probabilities for each topic",
       subtitle = "Different words are associated with different topics")

 

Our STM model also has another metric we can use and observe, the gamma matrix. This is the probability that each “document” is generated from each topic.

Questions we can observe here are:

  • How much did this document contribute to the topic?
  • How likely is this document to belong to this topic?

 

td_GammaMatrix <- tidy(STM_topic_model, matrix = "gamma",                    
                 document_names = rownames(MusicLyrics_dfm))

ggplot(td_GammaMatrix, aes(gamma, fill = as.factor(topic))) +
  geom_histogram(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, ncol = 3) +
  labs(title = "Gamma Matrix: Distribution of document probabilities for each topic",
       subtitle = "Each topic is associated with exactly 1 story",
       y = "Number of topics", x = expression(gamma))

 

From these results it seems each song is strongly associated with a single topic. This is an uncommon occurrence as topic modeling does not always work out this way. Remembering that we built a model with only a small number of “documents” and a small amount of “words”. This was still a good exercise in working with topic modelling and interpreting beta and gamma results.

Observing the results from the beta matrix, the topics, and music artist influences:

  • Topic 1 - hybrid of Daft Punk & Elton John
  • Topic 2 - strongly suggests Iron Maiden
  • Topic 3 - faintly describes a mix of Iron Maiden and Killswitch Engage
  • Topic 4 - appears to be mostly Led Zeppelin
  • Topic 5 - a mix of Iron Maiden, Elton John
  • Topic 6 - distinctively U2

 

5. Findings & Learnings

 

a. Findings

Returning to our guiding questions:

Can we make meaningful sense of the data? Yes! Across the selected 6 music albums and 59 individual songs identified:

  • Each song is uniquely different to the next, as we saw in the SongSim matrices, pair-wise correlations and song similarity dendogram.
  • Each album, and collection of songs within the album do suggest common overarching themes, as we saw in NRC emotional sentiments.
  • Some albums share similar themes, as we saw in the STM model and in the venn diagram.
  • Each artist (and album) has a particular preference in lyrical structure and word choice, as we saw in the SongSim matrices and Ngrams.

Is it safe to say that this collection of songs and albums is an example of an eclectic collection? Perhaps. Although, reality suggests that humans can find a way of proving and dis-proving similarity using a variety of configurable techniques.

Having never listened to Iron Maiden’s Powerslave album, this analysis has provided me with a much clearer understanding of what stories, derived meanings, and sentiments this album contains. That’s pretty cool!

Here’s how this analysis has helped me make sense of this collection of music data:

  • Daft Punk’s Discovery . There is lyrical evidence this album was engineered for sound and audio experience rather than wordy storytelling. There is low lexical diversity. High lyrical repetition. Emotional sentiment pointing well into the positive. There’s a high instrumental song count on the album compilation. The songs which do have lyrics tend to centre around one or two key words and are examples of songs which get stuck in our heads.
  • Killswitch Engage’s Alive or Just Breathing. This album shows signs of being engineered for a different kind of sound experience to that of Daft Punk. It has lower lexical diversity count (like Daft Punk) but has an emotional sentiment well into the negative. Keyword emphasis is different, indicating the vocal “growling” of lyrics and so careful selection of words to achieve this purpose could be what we are observing in a lower vocabulary range.
  • Elton John’s Honky Chateu.
  • Led Zeppelin’s Physical Graffiti.
  • U2’s The Joshua Tree.
  • Iron Maiden’s Powerslave.

 

b. Learnings, gotchas, traps for young players

Two key lessons I picked up on preparing this analysis:

  1. Text Mining, NLP, Sentiment Analysis and Topic Modelling are all large subjects to cover in their own right. One can easily go into great detail and depth on each of these subjects. This analysis was really touching the surface of these subjects. The downside of going into depth and detail is one must factor in more research, analysis and development time, but then circle back to the big questions, “what is the purpose of this analysis”, “what is the problem we are trying to solve and why does it matter?”.

  2. Consider, very carefully, the technical & narrative flow of the analysis for the reader’s (and developer’s) benefit. This analysis intended to start with the simple data clean up and descriptive stats then progress to more intermediate and advanced techniques, building with the “lego blocks” of data and constructs along the way.

 

c. Where to next & part 2

There were a few elements in this analysis we could unpack further, such as:

  • Song sentiment progression from first to last verse (by “sentence”).
  • What would happen if we applied word stemming, lemmatization, synonyms and hypernyms?
  • Including all albums released by the artists selected for this analysis. How much different would adding more “documents” make?

Given this analysis is all about music, we only covered one aspect in this analysis, the lyrical constructs. To make this a more holistic analysis, we will consider the audio components of music for a subsequent analysis.

 

I’d love to receive your thoughts, queries and feedback on this analysis. Please feel free to reach out to Bree at bree_mclennan@outlook.com.

 

Thanks for connecting and reading!

Bree.

 

6. References

Many hours have been spent researching approaches for designing and writing this analysis. Here’s some items I’d like to share:

Inspirations for this piece

  • TV series “It might get loud” with The Edge (U2), Jimmy Page (Led Zeppelin), Jack White (The White Stripes)
  • Game: Audiosurf (visualising sound, beat detection algorithms and digital signal processing)

Text mining, NLP and Machine Learning with Music Lyrics

Prince analysis - NLP

Prince analysis - Sentments

The Ramones

Rick and Morty

50 Years of Pop Music Lyrics

Radiohead & Using the Spotify API

Alternative sentiment analysis: Using the “gloom” index to find depressing songs

Alternative visualisations: Visualising songs as matrix structures and find repetitions

Topic Modeling of Sherlock Holmes Stories

Topic modeling using STM